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O ! Abstract 



A deterministic mutation-selection model in the sequence space approach is inves- 
tigated. Genotypes are identified with two-letter sequences. Mutation is modelled 
as a Markov process, fitness functions are of Hopfield type, where the fitness of a se- 
quence is determined by the Hamming distances to a number of predefined patterns. 
\ Using a maximum principle for the population mean fitness in equilibrium, the error 

threshold phenomenon is studied for quadratic Hopfield-type fitness functions with 
small numbers of patterns. Different from previous investigations of the Hopfield 
model, the system shows error threshold behaviour not for all fitness functions, but 
' only for certain parameter values. 
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1 Introduction 



Population genetics is concerned with the investigation of the genetic struc- 
ture of populations, which is influenced by evolutionary factors such as mu- 
tation, selection, recombination, migration and ge netic drift. For exce ll ent re - 
views of the theoretical a spects of this field, see 
Crow and Kimural (|l97nt l: lEwenl (|2004l b 



Baake and Gabriell t00(t 
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In this paper, the antagonistic interplay of mutation and selection shall be 
investigated, with mutation generating the genetic variation upon which se- 
lection can act. Pure mutation-selection models exclude genetic drift and are 
therefore deterministic models, and accurate only in the limit of an infinite 
population size (for a review, see Biirgeii EoOO). A further simplification taken 



here is to consider only haploid populations, where the genetic material exists 
in one copy only in each cell. However, the equations used here to describe 
evolution apply as well to diploid populations without dominance. 

For the modelling of the types co nsidered, the sequence space approach is 



used, which has first been used by lEigenl ([19711 1 to model the structure of 



the DNA, where individuals are taken to be sequences. Here, the sequences 
shall be written in a two- letter alphabet, thus simplifying the full four- letter 
structure of DNA sequences. In this approach, the modelling is based on the 
microscopic level, at which the mutations occur, hence the mutation process is 
fairly straightforward to model. However, the modelling of selection is a more 
challenging task, as selection acts on the phenotype, and the mapping from 
genotype to phenot ype is by no means simple . To this end, the concept of the 



fitness landscape ([Kaufiman and Levinl . Il987l ) is introduced as a function on 



the sequence space, assigning to each possible genotype a fitness value which 
determines the reproduction rate. Apart from the problem that a realistic 
fitness landscape would have to be highly complex (too complex for a mathe- 
matical treatment), there is also very limited information available concerning 
the nature of realistic fitness functions. Therefore, the modelling of fitness is 
bound by feasibility, trying to mimic general features that are thought to be 
essential for realistic fitness landscapes such as the degree of ruggedness. 

A very common type of fitness functions is the class of permutation-invariant 
fitness functions, where the fitness of a sequence is determined by the number 
of mutations it carries compared to the wild-type, but not on the locations of 
the mutations within the sequence. Although this model describes the accu- 
mulation of small mutational effects surprisingly well, it is a simplistic model 
that lacks a certain degree of rugged ness that is thought to be an important 
feature of realistic fitness landscapes f Ba ake and Gabriel Eoooh . 



In this paper, Hopfield-type fitness functions ( Hopfield . 1982h are treated as a 



more complex model. Here, the fitness of a sequence is not only determined by 
the number of mutations compared to one reference sequence, but to a number 
of predefined sequences, the patterns. This yields a class of fitness landscapes 
that contain a higher degree of ruggedness, which can be tuned by the number 
of patterns chosen. While this can still be treated with the methods used here, 
it is a definite improvement on the restriction of permutation-invariant fitness 
functions. 

Particular interest is taken in the phenomenon of mutation driven error thresh- 
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olds, where the population in equilibrium changes from viable to non-viable 
within a narrow regime of mutation rates. In this paper, a few examples 
of Hopfield-type fitness functions are investigated with respect to the error 
threshold phenomenon. 

Section 2 introduces the basic mutation-selection model with its main observ- 
ables. In section 3, the model is applied to the sequence space approach, formu- 
lating the mutation and fitness models explicitly. Sections 4 and 5 present the 
method, which relies on a lumping of the large number of sequences into classes 
on which a coarser mutation-selection process is formulated. This lumping is 
necessary to formulate a simple maximum principle to determine the popula- 
tion mean fitness in equilibrium. In section 6, this maximum principle is used 
to investigate some examples of Hopfield-type fitness functions with respect 
to the error threshold phenomenon. 



2 The mutation— selection model 



The model used here (as detailed below) is a p ure mutation-sele c tion m odel 
in a time-continuous form ulation as used by 
Garske and Grimrnl ()2004bl lah. for instance. 



Hermisson et al. (120021) and 



Population. The evolution of a population where the only evolutionary 
forces are mutation and selection is considered, thus excluding other factors 
such as drift or recombination for instance. Individuals in the population shall 
be assigned a type i from the finite type space &. The population at any time t 
is described by the population distribution p(t) , a vector of dimension |©|, the 
cardinality of the type space. An entry Pi(t) gives the fraction of individuals 
in the population that are of type i. Thus the population is normalised such 
that Y,iPi{t) = 1- 



Evolutionary processes. The evolutionary processes that occur are birth, 
death and mutation events. Birth and death events occur with rates 6j and di 
that depend on the type i of the individual in question, and taken together, 
they give the effective reproductive rate, or fitness as T{ — hi — d{. Mutation 
from type % to type j depends on both initial and final type and happens with 
rate m^. These rates are conveniently collected in square matrices 1Z and Ai 
of dimension \&\, where the reproduction or fitness matrix 1Z with entries 
Ti is diagonal. The off-diagonal entries of the mutation matrix M. are given 
by the mutation rates rriji, and as mutation does not change the number of 
individuals, the diagonal entries of M. are chosen such that Ada = — -Mji, 
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which makes Ai a Markov generator. The time evolution operator Ji is given 
by the sum of reproduction and mutation matrix, 7i = 1Z + M.. 



Deterministic evolution equation. In the deterministic limit of an infi- 
nite population size, the evolution of the population is governed by the evo- 
lution equation 

pit) = [H - f(t)l]p(t) i (1) 
where f(t) = J2i r iPi{t) is the population mean fitness. The term with f is 
needed to preserve the normalisation of the population. Note that this term 
makes the evolution equation (1) nonlinear. 



Equilibrium. The main interest focuses on the equilibrium, i.e., the be- 
haviour if p = 0, which is attained for t — > oo. All equilibrium quantities shall 
be denoted by omitting the argument t, for instance the equilibrium popu- 
lation distribution is p. In equilibrium, the evolution equation (1) becomes 
an eigenvalue equation for T~C, with leading eigenvalue f and corresponding 
eigenvector p. If M. is irreducible, a s shall be ass umed throughout, Perron- 
Frobenius theory (see, for instance, EH E*0, appendix) applies, which 
guarantees that the leading eigenvalue r of 7i is non-degenerate and the cor- 
responding right eigenvector p is strictly positive, which implies that it can 
be normalised as a probability distribution. 



Ancestral distribution. Similarly to the population distribution p, there 
is also another important distribution in this model, namely the ancestral 
distribution a(r,t). Consider the population at time t + r, but count each 
individual not as its current type, but as the type its ancestor had at time t. 
Thus an entry of the ancestral distribution a^r, t) determines the fraction of 
the population at t + r whose ancestor at time t was of type i. In the limit 
T,t — » oo, this also ap proaches an equilibrium distribution a. As shown by 



Hermisson et al.1 (2002), the equilibrium ancestral distribution can be obtained 
as a product of the left and right PF (Perron-Frobenius) eigenvectors z and 
p of the time-evolution operator Ti as = ZiPi, where z is normalised such 
that Yli a i — 1- 



Population and ancestral means. Any function on the type space given, 
say, by / = (fi)ie& can be averaged with respect to the population or the 
ancestral distribution distribution. The population mean of / is given by 

/(*) = E/iP*(*). ( 2 ) 
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whereas the ancestral mean is 



f(-r,t) = ^2fMT,t). (3) 

Note that the time-dependence of the means only comes from the distribu- 
tion, whereas the function / is considered constant in time. In equilibrium, 
time dependence is again omitted such that the equilibrium population and 
ancestral means are denoted / and /, respectively. An important example of 
the population mean is the population mean fitness f(t) from equation (1). 



3 Sequences as types 



In the previous section, the types are a rather abstract concept. In order 
to formulate the particular mutation and fitness models, they shall now b e 
specified as sequences, mimicking the structure of the DNA (cf. lEigen . 1971 ). 



For simplicity, only two-state sequences are considered, i.e., sequences that 
have at each site one out of two possible entries. However, the method used 
here can imm ediately be generalised to a more realistic four-state model (see 
Garskd . li)05h . 



The types therefore are associated with sequences cr = a\a 2 . . .cr N of fixed 
length N, written in the alphabet A = {0, 1}, thus <r Q G A for a = 1, . . . , N. 
This means that there are 2 N different sequences, and thus the type space (or 
sequence space) & has cardinality \&\ = 2 N . 



3.1 Mutation model 



A simple mutation model that neglects any processes changing the length of 
the sequence, such as deletions or insertions, is used. Mutations are modelled 
as point processes, where an arbitrary site a a is switched with rate /x, such 
that the mutation rate between sequences that differ only in one particular site 
is given by fi/N. Sequences that differ in more than one site cannot mutate 
into one another within a single mu tational step. This is kno wn as the single 
step mutation model, introduced by Ohta and Kimural (|l973| ). 



The mutation mod el defines a neighbourhood in the sequence space 

f Reidvs and Stadler . 2002). A convenient measure for the distance between 



sequences is the Hamming distance dnicr^ cr'), which counts the number o f 
sites at which the sequences cr and cr' differ f Hamming . 19501 van Lint . 1982). 
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With this, the mutation matrix is explicitly given as 



U/N if^(<r,0 = l, 

M tra > = h ifd H (<r,<r')>l, (4) 

[— yU if <J = <j' . 

The diagonal entry is chosen such that Ai fulfils the Markov condition 



3.2 Fitness functions 



A rather simple, though commonly used type of fitness function is the permu- 
tation-invariant fitness. There, the fitness of a sequence depends only on the 
number of mutations it has compared to a reference type, not on their position 
along the sequence. Thus fitness is a function of the Hamming distance to the 
reference sequence, which is usually chosen as the wild-type. The Hamming 
distance to the wild-type is also called the mutational distance d. 

A non-permutation-invariant fitness that contains some ruggedness, but is 
simple enough to be dealt with in this framework, is the Hopfield-ty pe fitness. 



a spec ial type of spin-glass model, which has been introduced by iHopfield 



(1982) as a model for neural networks. Instead of comparing a sequence only to 



the wild-type, as it is done for the permutation-invariant fitness, the Hopfield- 
type fitness of a sequence is determined by its Hamming distances to p + 1 
reference sequences, the patterns £ q , q = 0, . . . ,p. The Hopfield-type fitness 
shall be defined in terms of the specific distances w q , which are the Hamming 
distances with respect to the patterns, 



w' = d H (F,o-), (5) 

and thus the fitness is given as 

r-a = r tT ((w q ) q= o ! ... tP ) . (6) 

Note that in the case of a single pattern (p = 0), this yields again a 
permutation-invariant fitness. 
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4 Lumping for the Hopfield-type fitness 



4-1 The general lumping procedure 



One problem of the sequence space approach is the large number of types, 
which grows exponentially with the sequence length N, \&\ — 2 N . The time- 
evolution operator TC is a matrix of size \&\ x \&\, and in this set-up one 
is interested in its leading eigenvalue f and the corresponding right and left 
eigenvectors p and z. 

The relevant sequence length depends on the particular application one has in 
mind, but it is typically rather long. If one aims to model the whole genome of 
a virus or a bacterium, N has to be in the region of iV « 10 6 , but even a single 
gene has of the order of N « 10 3 base pairs. These values lead to matrices of 
a size that makes the eigenvalues and eigenvectors inaccessible. 

For some types of fitness functions, this problem can be reduced by lumping 
together types into classes of types, and considering the new process on a 
reduced sequence space, which contains the classes rather than the individual 
types. Under certain circumstances, mutation is described as a Markov process 
in the emerging lumped process as well, such that this process is accessible 
to Markov process methods, and the framework developed in section 2 can 
directly be applied to the lumped system. 

The lumping of th e mutation process is a s tandard procedure i n the theory 
of Ma rkov chains ( Kemenv and Snell 1960L chapter 6), see also Baake et al 



(2005) for an application to mutation-selection models. This lumping leads 



to a meaningful mutation-selection model on the reduced type space, if all 
sequences lumped together into one class have the same fitness. 



It is possible to lump the Markov chain given by the mutation matrix M. with 

• T 

state space (5 with respect to a particular partition & = [j k=0 & k , if and only 
if for each pair & k , ©i the cumulative mutation rates 

«6<,i := Yl M Ji ( 7 ) 



from type i G &k into &e, are identical for alH £ cf. the example shown in 
figure 1. In this case, the lumped process, with r + 1 states &n & r and mu- 
tatio n rates ue e ,i for any i 6 & k , is again a Markov chain ([Kemenv and Snell, 
1960L theorem 6.3.2). 
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Fig. 1. Visualisation of the compatibility with lumping: Consider two classes &k and 
&£. The mutation rates from the types in 6^ to the types in (2>£ (given next to the 
arrows) are compatible with a lumping with respect to 6^ and <&£, because the sum 
of the mutation rates from type i\ to all types in &i is given by tie^u = 1 + 1 + 2 = 4, 
which is identical with those from 12, u& lt i 2 = 3 + + 1 = 4. 

4-2 Defining the partial distances 



Whereas in the case of a permutation-invariant fitness function, the lumping 
procedure is fairly simple, collecting all sequences with the same Hamming 
distance to the wild-type into classes, and considering cumulative mutation 
rates between these classes, the lumping for the Hopfield-type fitness is some- 
what more complex. For a two-state model w ith Hopfield fitness, it has been 



performed for instance hv lRnake et nil and tins shall be recollected in 



the remainder of this section. 

First, the quantities with respect to which the lumping shall be performed 
must be denned. To this end, consider as an example the case of sequence 
length N = 12 with three patterns (p = 2), and let the patterns £ 9 be collected 
in a (p + 1) x iV matrix such that the qth row of £ is pattern £ q . Without 
loss of generality, the pattern £° can always be chosen as £° = 000 ... 0. Let 
the patterns in this example be given as 



^0 0000000000 0^ 



101101011100 
100011111110 



Note that there are only 2 P different types of sites (corresponding to the 
columns of £). These are collected in a (p + 1) x TP matrix p, the columns of 
which correspond to the possible types of sites in the matrix of patterns £. 
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For the case p — 2, this 3x4 matrix is given as 



^0 0^ 



P = 



11 
10 1 



(9) 



Using the column vectors p v of the matrix p, the patterns given in equation 
(8) can alternatively be expressed as 



£ = (P4> Pl> P3> P3> P2) P4) P2, P4 ? P4> P4> P2> Pi) , 



(10) 



classifying the sites into 2 P classes according to which of the column vectors 
p v of p coincides with the column vector £ a of the patterns £ at site a. 

Let A = {1, . . . , N} be the index set of sites, with a partition into 2 P subsets 
A v induced by the patterns £ such that 



• 2 p 

A = U = A > 



with 



a G A, 



Pv 



(11) 



:i2) 



Patterns can be characterised by the number of sites N v = \A V \ in each subset. 
The example patterns (8) can therefore be described by 

Ax = {2, 12}, A 2 = {5,7,11}, A 3 = {3,4}, A 4 = {1, 6, 8, 9, 10} , 
jV 1 = 2, N 2 = 3, N 3 = 2, N 4 = 5. (13) 



Considering only subsequences cr v := {cr a ) ae A v , the partial distances d v of a 
sequence cr with respect to the pattern £° are defined as the Hamming distance 
between cr and £°, restricted to the subsequence cr v . Therefore, they can be 
written as 

d v ■= with 0<d v <N v , (14) 

aeA v 

such that the specific distance with respect to pattern £° is given by w° = 
T 2P d 

Because the differences between each of the patterns £ g within the subsets 
A„ of A are known (and recorded in the matrix p), it is sufficient to consider 
only the partial distances d v with respect to one pattern, here the partial 
distances d q v with respect to any other pattern can be expressed in terms of 
the d v as 

*i=Y,s(i-M=\t H Y: = ^ (15) 

\N v -d v ifp« = l, 
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using the matrix elements pi of p. 

The specific distance w q to any pattern £ 9 can be expressed as 

»' = EE 6(l-ffl v ,a a )= £ d v + E( N v-dv), (16) 

where the index set V = {1, ... , 2 P } of classes is partitioned into two subsets, 
V = V<? U with \/ 9 = {w|p« = 0} and If = {v\p% = 1}. 

Hence, by specifying the 2 P partial distances d v with respect to pattern £ , 
the specific distances with respect to any pattern £ q are determined, which 
in turn determine the fitness. This implies that all sequences with the same 
partial distances d v have the same fitness. Thus the partial distances d v to 
pattern £°, collected in a mutational distance vector d = (d v ) v= i,,,2P, shall be 
the quantities that label the classes in the lumped system. 

4-3 Lumping with respect to the partial distances 

The relevant partition of the sequence space is given by & = Ud©d with 
&d = {crlda- = d}, and the reduced sequence space, or mutational distance 
space S, contains the classes S = {d\0 < d v < N v , v = 1 . . . 2 P }. 

Considering again the subsequences er v , there are N v + 1 possible different d v 
(as d v takes values from to N v ), and there are Qf") different subsequences 
cr v for each d v . Hence, considering all sites, there are 

i5i = n(^+ i ) (17) 

v=l 

different d, and 




sequences cr that are mapped onto each d. For the patterns £ chosen as ex- 
ample (8), we have |<S| = 3 • 4 • 3 • 6 = 216, while the full sequence space has 
dimension \&\ = 2 12 = 4096. 

In the single step mutation model, the only neighbours of a sequence cr with 
distance vector d lie in the classes d± e v , where the e v = (S vw ) w= i,„2P are the 
unit vectors of mutation. Thus the only non-zero cumulative mutation rates 
are 

Ua v ■= u d ± ev)(T = Ma'a- (19) 

o-'ee d±ev 
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As a sequence with d has (N v — d v ) O-sites and d v 1-sites in A„, and the single- 
site mutation rates are fi/N for all sites, the cumulative mutation rates are 
given by 

u d v = [i (N v — d v ) /N for d — > d + e v and 

u d v = fid v /N for d ->• d - e v , (20) 

irrespective of the particular order within the subsequences cr v . Therefore the 
cumulative mutation rates u d v are the same for all sequences cr with the 
same d, which is the condition for lumping. The mutation-selection model 
with Hopfield-type fitness is indeed "lumpable" with respect to the partition 
induced by the distance vectors d. 

The mutation-selection process on the mutational distance space S is de- 
scribed by the lumped time-evolution operator H = R + M with lumped re- 
production and mutation matrices R and M of dimension |«S| x Whereas 
the lumped reproduction matrix R is still diagonal and contains the same 
entries as 71, i.e. Rd„ = Tier, the off-diagonal entries of the lumped muta- 
tion matrix M are given by the cumulative mutation rates with unchanged 
diagonal entries compared to M., which still fulfil the Markov property 
M dd = - J^d'^d M d'd, and thus 



M. 



did 



u^ v if d' = d + e v , 

u d v if d' = d — e v , 

-E v (u+ v + u d v ) = -fi if d' = d, 

otherwise, 



(21) 



with the cumulative mutation rates from equation (20). For a more gen- 
eral derivation of the lumped reproduc tion and mutation matrices, see 
Garske and Grimml (|2004bh : iGaTskel (|2005h . 



Note that the time-evolution operator H acting on S describes the evolution 
of a population under mutation and selection determined by the evolution 
equation (1), and thus the theory developed in section 2 applies. 



5 The maximum principle 



Although the lumping procedure reduces the number of types very efficiently, 
the evaluation of the eigenvalues and eigenvectors of the time-evolution op- 
erator H still remains a difficult problem for many applications, due to the 
size of the eigenvalue problem. If one is interested solely in the equilibrium 
behaviour of the system, however, it is possible to determine the population 
mean fitness (at least asymptotically for large sequence length N), given by 
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the leading eigenvalue of H. This can be done by a simple maximum principle 
that can be derived from Rayleigh's general maximum principle, which spec- 



ifies that the leading eigenvalue A r 
via a maximisation over R™ 



of an n x n matrix H can be obtained 



A, 



v T Hv 

SUp 7f 



(22) 



The vector v for which the supremum is attained is the eigenvector corre- 
sponding to the eigenvalue A max . The simple maximum principle derived from 
this guarantees that the population mean fitness f can be obtained by max- 
imising a function on the mutational distance space S. It can be shown that 
the maximiser itself is the ancestral mean mutational distance. 



Such a maximum principle has first been derived by Hermisson et al. ( 20021 ) 
for two-state sequences with permutation-invariant fitness. This has been gen- 
eralised to apply for four-sta te sequences with permutation-invariant fitness by 
Garskc and Grimm (|2004bi) . and subsequentl y the restriction to permutation- 
invar iant fitness fn nc~as been relaxed hv lRaalw et al l . The results 

from iRaake et al.1 J2005) apply directly to the Hopfield-type fitness treated 
here. 



5. 1 Symmetrisation of M 



Whereas the original mutation matrix M. is symmetric, the lumped muta- 
tion matrix M is no longer symmetric, as different numbers of sequences are 
lumped into the different classes, therefore giving rise to unequal cumulative 
forward and backward mutation rates. To derive the maximum principle, it is 
necessary to symmetrise the mutation matrix M . 

M is reversible, i.e., 

M M >7r d > = M d/d ir d , (23) 

where 7r = (7r d ) d£ s is the stationary distribution of the pure mutation pro- 
cess, which is given by the equidistribution of types on (5, and thus given 
by the number of sequences n d that are lumped onto the same mutational 
distance vector d. The reversibility of M implies that it can be symmetrised 
by the means of a diagonal transformation II := diagjTTd}, which yields the 
symmetrised mutation matrix as 

M = r 1/2 Mn 1/2 , (24) 

with off-diagonal entries 

M M = M M y/vr^/TTd = M d > d ^ d ~i7 d , = ^M dd! M d , d = M d , d . (25) 
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Using the cumulative mutation rates u d , this reads 

Md'd := uf = uH ev = y/itfu^ if d' = d ± e v and otherwise, (26) 

as can be seen by using the explicit representation for the cumulative muta- 
tion rates from equation (20) with the n d from equation (18). Because II is 
diagonal, the diagonal entries of the mutation matrix are unchanged, 

M dd = M dd = -n (27) 

As R is diagonal as well, it is not changed by the transformation II 1 / 2 , and 
thus this transformation also symmetrises the time-evolution operator such 
that 

H = n 1/2 HU 1/2 = R + M (28) 

is symmetric. 

Before symmetrisation, H was expressed as the sum of a Markov generator 
M and a diagonal remainder R. As the transformation II does not preserve 
the Markov property, this is not the case for the symmetrised time-evolution 
operator in (28). It is however useful to split it up this way. To this end, let 

H = E + F, (29) 

where F is a (symmetric) Markov generator and E is the (diagonal) remainder. 
The off-diagonal entries of F are given by those of M from equation (26), 
Fd'd — M d > d for d' 7^ d, whereas the Markov property requires as diagonal 
entries 

F dd = -J2 F d ' d = - E + V) 

The remainder E is given by 

E d = R d + M dd - F M 

= Rd-T, + V) + E (5? + V) 



5.2 Continuum approach for the limit of infinite sequence length 

To deal with the case of infinite sequence length, it will prove useful to use 
intensively scaled normalised versions of the extensively scaled variables like 
the mutational distances. The pattern in the Hopfield model, previously char- 
acterised by the number of sites N v in each subset A„, will now be described by 



= -E(v / ^^ + vV<4-J 

(30) 
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the fraction of sites in A„, given by X v := N v /N. Similarly, we use normalised 
partial distances 

x v := d v /N v , (32) 

where x v G [0,1], with the normalised mutational distance vector x = 
(x v )v=i...2p- The permutation-invariant model is obtained in the case p — 0. 

For finite N, x takes rational values in a normalised version of the mutational 
distance space • S C T>, where T> is a compact domain in M, 2P . For N — > oo, 
the vectors cc become dense in D. 

Assume that the entries of H — E + F can be approximated by functions 
e and / from C^(T>, M), i.e., twice continuously differentiable functions with 
bounded second derivatives that map V onto M such that 

E d = e(x d ) + O (1) , (33) 
Fj d = U(x d ) + o(±} , (34) 

where A = d! — d and the notation x d is used to emphasise that the nor- 
malised mutational distance x corresponding to a particular d is meant. In 
fact, assumption (34) can readily be verified for the cumulative mutation rates 
from equation (20): 



Let the functions f/±{x) be 

/a (a;) = < 

where the functions u v {x) are given by 



'u v (x d ) if A = ±e„, 

-2Eu v (x d ) ifA = 0, (35) 
otherwise, 



u v (x) := ^u +v (x)u- v (x) , (36) 

with the cumulative mutation rates u ±v (x d ) := u^", which read explicitly 

u +v (x) = /iX v (l — x v ) and u~'°(x) = /iX v x v . (37) 

Using a Taylor approximation, it can be shown that the differences between the 
exact entries of F as given in equations (26) and (30), and their approximations 
from equation (35), are indeed of O (^)- 

Assuming that also the reproduction rates R d can be approximated by a 
function r(x) as 

R d = r(x d ) (38) 
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then from equation (31), the matrix E is approximated by 



e{x) = r{x) - (u +v (x) + u~ v {x) - 2u v {x)j , (39) 

V 

fulfilling equation (33). With the definition of the mutational loss function g 



as 



g {x) : = (u +v (x) + u~ v (x) - 2u v (x] 

v 



v=l 



1 - 2\ xJl - x v ) 



this reads explicitly 



e(x) = r(x) — g(x) . 



(40) 
(41) 

(42) 



We are now in a position to apply theorems 1 and 2 from Baake et al. (|2005| ). 
which read for the mutation-selection model with Hopfield-type fitness con- 
sidered here 

Theorem 1 (The maximum principle). 

(i) Assume that for the lumped mutation-selection model as set up in section 
4 it is possible to approximate the reproduction rates Ra by a C% function 
r(xd) as specified in equation (38), and that the C% function e assumes its 
global maximum in the interior int(P). Then the population mean fitness in 
equilibrium is given by 



r = sup [r(x) -g{x)} + O 



(43) 



(ii) Assume furthermore that e assumes its maximum at a unique point x* 
in int(X>), and that the Hessian of e at x* is negative definite. Then in the 
limit of N —> oo, the maximiser x* is given by the mean ancestral mutational 
distance x, and in particular 



r = r(x) — g(x) . 
For a proof the reader is referred to iBaake et al. (2005). 



(44) 



6 Error thresholds 



The maximum principle (43) and (44) is a powerful tool to calculate the 
population mean fitness f in equilibrium for arbitrary fitness functions of the 
permutation-invariant or Hopfield type, for any range of mutation rates. Also, 
the ancestral mean genotype x is available. The general method to identify 
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f and x is to consider the partial derivatives of r — g with respect to the 
components x v of the mutational distance x. A necessary condition for the 
function r — g to have a maximum at a value x* is that its derivatives at this 
x* vanish, 

-?L[r(x)-g(x)} x=x * = Vv,k. (45) 

The global maximum of the function r — g must lie on one of the points x* 
that fulfil equation (45) or on the border of the mutational distance space. 
Thus by comparing the values of r — g on these possible points, the global 
maximum can be identified. 

Apart from the general possibility to investigate the dependence of the pop- 
ulation mean fitness f on the mutation rate /i, this yields the opportunity 
to investigate the phenomenon of the erro r threshold, w hich has interested 



scientists ever since it was first conceived bv Eigenl (Jl97l|). 



The phenomenon of the error threshold can be described as the existence of a 
critical mutation rate, below which the equilibrium population is well localised 
in sequence space, whereas for mutation rates above the critical mutation rate, 
the equilibrium population is more delocalised, with a sharp transition between 
the two phases. 

One problem is that there is no generally accepted definitio n of an eiror thresh- 



old. The criterion used in the original quasispecies model (jEigenl . 119711 ) is the 
disappearance of the wild-type from the population, which under the single 
peaked landscape used there goes in line with the complete delocalisation of 
the population in sequence space. However, these two effects do not necessarily 
coincide for other fitness landscapes. 

The definition of the error threshold that shall be used here is equivalent to 
the definition of a phase transition in physics, differentiating between first and 
second order transitions as follows: 

Definition (First and second order error threshold). 

(i) A first order error threshold exists at a critical mutation rate fi c , if the 
ancestral mean mutational distance as a function of the mutation rate x(fi) 
shows a discontinuity at this fj, c , which is also reflected by a kink in the pop- 
ulation mean fitness f(/x). 

(ii) A second order error threshold exists at a critical mutation rate /i c , if 
the ancestral mean mutational distance is continuous, but its derivative with 
respect to the mutation rate 

fJ>c- 



dx 

d/j, 



is discontinuous at this mutation rate 



In the examples shown later in this thesis, the second order error threshold 
always show an infinite derivative at the critical mutation rates. Note that, 
like phase transitions in physics, these definitions of the error thresholds apply 
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in the strict sense only to a system with infinite sequence length (N — > oo), 
for finite sequence lengths, the thresholds are smoothed out due to the lack of 
non-analyticities. 



Hermisson et al. ( 2002j ) gave a finer classification of different error threshold 
phenomena. The first order error threshold they called "fitness threshold". 
Here, this term shall include also the second order error threshold, making all 
error thresholds as defined above fitness thresholds. Furthermore, the concept 
of the "degradation threshold" was introduced: 

Definition (Degradation threshold). 

A degradation threshold is an error threshold of first or second order, where 
the population distribution beyond the critical mutation rate /i c is given by 
the equidistribution in sequence space &. 

Thus here the degradation threshold is a special case of a fitness threshold, 
going in line with the complete delocalisation of the population in sequence 
space. Note that in the limit of infinite sequence length (N —>■ oo), for which 
the error threshold definitions apply exactly, this equidistribution is reached 
immediately above fi c , and beyond the threshold the population is insensitive 
to any further increase in mutation rates. In the case of finite sequence lengths, 
where the thresholds are smoothed out, the equidistribution is of course only 
reached asymptotically. 

The original error threshold was observed for the single peaked fitness land- 
scape, where a single sequence is attributed a high fi tness value, all oth er se- 
quences are equally disadvantageous (for a review, see Eigen et al. . 1989f ). This 
is clearly an oversimplification and should not be regarded as anything but a 
toy model. Other fitness landscapes that have been investigated comprise, in 
the permutation-invariant case, linear and quadratic fitness functions, general 
functions showing epist asis, and as examp l es lacking permutat i on-iny ariance 



the Onsager landscape (IBaake et all Il997t iBaake and Wagnerl 120011 1 . which 



has nearest neighbour interactions within the s equence, as w e ll as various spin 
glass landscapes like the Hopfield landscape dLeuthaussei . 1987 : Tarazona . 
1992), the Sherringto n-Kirkpatrick spin gla ss ((Bonhoeffer and Stadlerl . Il993| ). 



the NK spin glass ( Campos et al. . 20021) 



(|Franz et all [1993; 

to each sequence. 



Franz and Pelitil . |1997) 



and the random energy model 
assigning random fitness values 



One fitnes s landscape where an analy t ical solution can be obtai ned is the linear 
fitness (cf . iRumschitzkvl . ElU iHiggsl Eaijl IBaake et all EaaJ . Note that this 
corresponds to a multiplicative landscape in a set-up using discrete time. For a 
linear fitness function, there is no error threshold, but the population changes 
smoothly from localised to delocalised with an increasing mutation rate. 



For quadratic fitness functions, error thresholds only exist for antagonistic 
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epista sis; they are absent for quadratic fitness functions with synergistic epis - 
tasis ((Baake et all Il997t iHermisson et all l2001t iGarske and Grimml . l2004ar l . 
These results go in line with those for general epistatic fitness functions 
flWiehel Il997h . Studies using non-permutation-invariant fitness functions gen- 
erally report the presence of error thresholds. 



Of course the discussion of the error threshold phenomenon is academic if 
the threshold is an artifact of the model rather than a real biological phe- 
nomenon. This issue has been subject to numerous debates, especially be- 
cause it has first been predicted by a model using the over-simplistic sin- 
gle peaked landscape. However, over the years biologists have accumulated 
evidence that particularly RNA vir u ses n a turally thrive at very high mu- 
tation rates ([Domingo and Holland! . Il988l lEieren and Biebricherl. Il988h. o f 



the order of 10 4 to 10 5 per base per replication ((Domingo et all [1996) 



corresponding t o a genomic mutation rate of about 0.1 to 10 mutations 
per replication (D omingo and Holland! . Il997l ). and a number of studies have 
reported that populations of RNA viruses only survive a moderate in- 
crease of their mutation rate, whereas i f the mutation rate is increased 
further, the popu l ations become extinct dHolland et al. . 19901 Loeb et al 



99i ISierra et all 1200(1 ICrottv et all l200lh . for reviews see iDommgo et aL 
( 199fih7 l Domingo and Holland! (|l997t) . This corresponds to the population be- 



ing pushed beyond the error thre shold. I t has been suggested to use the error 
threshold for anti- viral therapies ([Eigen , and in fact, recent experimen- 

tal results indicate that this is t he mechanism via wh ich the broad-spectrum 



anti- viral drug ribavirin works (|Crottv et all 120011 ). This clearly warrants 



some further investigation of the error threshold phenomenon, which shall 
be done in the remainder of this section. 



6.1 Different Hop field-type fitness functions 



The original Hopfield fitness as introduced by iHopfie 3 (|l982l) is a quadratic 
function of the specific distances and reads 



q=0 



E(y q ) 

9 =0 



(46) 



using normalised specific distances y q := w q /N, similarly to the normalised 
mutational distan ces x. The statistica l properties o f this l andscape have been 



Il985al lbl iTalagrandl . fcZQOah : In the thermody- 



studied in detail ([Amit et al 
namic limit A" — > oo, there are 2{p + 1) global maxima that are associated 
with the patterns (£ g ) and their complements (1 — £ q ). In addition to that, 
the number of local maxima and saddle points grows exponentially with the 
number patterns p + 1, hence the ruggedness of the fitness landscape can be 
tuned by the number of patterns. 



18 



Most works that have studied a Hopfield-type fi tness used th e original Hopfield 
model, a generalisation was however treated by Peliti (200^), using a Hopfield- 



type truncation selection with two patterns. Thus it might be interesting and 
instructive to investigate the threshold behaviour of different kinds of Hopfield- 
type fitness functions. 

Ap plying criteria for the ex istence of error thresholds that have been obtained 



Ap plying criteria tor tne ex istence ot error tnresnolds mat nave been obtained 
by iHermisson et al. (2002) for permutation-invariant fitness functions to the 



case of a Hopfield-type fitness, it can be shown that for linear Hopfield-type 
fitness functions there are no error thresholds, which is a new result, consid- 
ering that for all previously investigated Hopfield-type fitness functions, the 
existence of error thresholds was reported. 

The next step towards more complex fitness functions is to consider quadratic 
fitness functions, generalising the original Hopfield-fitness, which is a particu- 
lar example for a quadratic function. Here, the analysis shall be restricted to a 
symmetry with respect to the normalised specific distances y q to the patterns 
£ 9 , such that 

r = cf>*±£(y«) a • (47) 

g=0 q=0 

The parameter c tunes the linear in relation to the quadratic term, and the 
sign of the quadratic term determines the epistasis, a measure for the strength 
of interaction between sites. For a positive quadratic term epistasis is said to 
be negative or antagonistic, whereas for a negative quadratic term one speaks 
of positive or synergistic epistasis. The case c = — 1 combined with a positive 
quadratic term (i.e., negative epistasis) yields the original Hopfield fitness. 



6.2 Quadratic symmetric Hopfield-type fitness with two patterns 



In the case of two patterns, p — 1, the first pattern can be chosen without loss 
of generality as £° = 00 ... 0, such that there is only one pattern to be chosen, 
usually randomly. The matrix p containing the possible types of sites is given 
by 

/ n n \ 

(48) 

and thus the index set of sites is partitioned into two subsets, A = A x U A 2 , 
where A x contains all sites at which both patterns have entry 0, whereas A 2 
corresponds to the sites where the two patterns have entries and 1, respec- 
tively. The only quantities characterising the patterns are now the fractions 
of sites in each partition, X\ = Ni/N and X 2 = N2/N = 1 — X\. Thus the 
pattern can be characterised by a single parameter, X%. 
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X l = X 2 = 1/2 



X l = l-X 2 = 0.65 



r -0.4 \ 
-0.5 



-0.3 




Fig. 2. Original Hopfield fitness in the case of two patterns with different X\,X 2 . 

Each sequence is characterised with respect to the pattern by the partial 
Hamming distances to pattern £° (in normalised form), X\ and x 2 . These vary 
from (all entries in A^) to 1 (all entries 1 in A„), completely independently 
from each other. The specific distances with respect to the patterns are linear 
combinations of the x v and given in normalised form by 



The Hopfield-type fitness is defined as an arbitrary function of these patterns, 
r = fin ,!/ 1 )- Due to the small number of variables, for the case of two pat- 
terns, a lot can be done by analytical treatment. 

For the quadratic symmetric Hopfield-type fitness (47) with positive epistasis 
(i.e., negative sign of the quadratic term) and c = 1, there are no phase 
transitions, going in line with the results for permutation-invariant fitness 
functions, but different from other results for Hopfield-type fitness functions. 
As an example for negative epistasis, consider first the original Hopfield fitness 



6.2.1 The original Hopfield fitness with two patterns 

Since for two patterns there are only two variables, it is possible to visualise 
the fitness landscape in this case. Figure 2 shows the original Hopfield fitness 
(46) for the cases X ± — X 2 — 1/2 and X 1 = 1 — X 2 = 0.65. In the corners of 
the mutational distance space S, one can see the four degenerate maxima. 

The ancestral mean partial distances x v , at which the maxima of r — g are 
positioned, are obtained by considering the derivatives of r — g. They are given 



y ° = W °/N = X lXl + X 2 x 2 




(49) 



(46). 



by 




for n < X,, 
for pL> X, 



(50) 
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Fig. 3. Ancestral mean partial distances x v (top) and specific distances y q (bottom) 
depending on mutation rates. The original Hopfield fitness (46) for two patterns has 
been used. Results correspond to uncorrelated patterns {X\ = X 2 = 1/2, left) and 
correlated patterns with with X\ = 1 — X 2 = 0.65 (right). 



For the case of Xi = X 2 = 1/2, which corresponds to two completely uncor- 
related patterns, the ancestral mean partial distances x v are shown in figure 
3 on the left, alongside the ancestral mean specific distances y q = y q (x v ). For 
low mutation rates, there are two possible solutions for each of the ancestral 
mean partial distances x v , and as the maxima are degenerate, in equilibrium, 
the population will be centred equally around all of them. However, in the 
approach to equilibrium, the population might well be predominantly con- 
centrated around one of them, depending on initial conditions. The specific 
distances y q that are shown correspond to the combination of x\ and x 2 , where 
both are given by the lower branch. Other combinations yield similar results. 
For high mutation rates, the population is in the mutation equilibrium with 
x\ = x 2 = 1/2, forming a disordered phase. In the limit of low mutation rates, 
H — > 0, the population is always in the vicinity of one of the patterns (or its 
complement), such that one of the y q ~ (1), which is completely random 
with respect to the other pattern, and thus the other y q = 1/2. This is the 
ordered phase. At the critical mutation rate /i c = 1/2, there is a second or- 
der phase transition between these two phases, which is a fitness as well as a 
degradation threshold, corresponding to the infinite derivative of both x v at 
this mutation rate. As the specific distances y q are simply superpositions of 
the partial distances x v , the phase transitions are also visible in the y q . 
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In the correlated case X\ ^ X 2 (figure 3, right), two second order transitions 
can be identified. At // = Xi, X\ has a phase transition, whereas at fi = X 2 , 
x 2 has a phase transition. The threshold occurring at the lower mutation rate 
is only a fitness threshold, whereas the one happening at the higher mutation 
rate is both a fitness and degradation threshold, leading to a totally random 
population. For < \i < min(Xi,X 2 ), the population is in an ordered phase, 
for minpTi, X2) < n < max(Xi, X 2 ), it is in a partially ordered phase, which 
is ordered with respect to one of the variables, but random with respect to 
the other. Finally, for /i > max(X 1 , X 2 ), the population is the equidistribution 
in sequence space. Here again, for low mutation rates the population is close 
to one of the patterns, but due to the correlation in the chosen patterns, this 
leads to a non-random overlap with the other pattern. In the uncorrelated 
case with X 1 = X 2 = 1/2, the two error thresholds coincide, and the partially 
ordered phase vanishes. 

6.2.2 Deviations from the original Hopfield model 

Now turn to the question how these phase transitions depend on the particular 
degeneracy of the fitness functions and consider the quadratic fitness function 
(47) with negative epistasis (i.e., positive quadratic term) for values of c 7^ —1. 

Figure 4 shows the fitness landscapes for values of c = —0.9 and c = —1.1 
for uncorrelated patterns Xi = X 2 = 1/2 and correlated patterns with 
Xi — 1 — X 2 = 0.65. As the pictures indicate, the fitness functions (and 
thus the behaviour of the system) with the same \c+ 1| are related by symme- 
try under x\ — > 1 — x\ (apart from a constant term, which does not influence 
the dynamics). Note that in ^-direction, the fitness function is independent of 
c. This is because in the sum of the specific distances, the term with x 2 cancels 
out, which happens generally in the case of an even number of patterns (i.e., 
odd p) for ( , J different variables. 



Because in ^-direction the fitness is independent of c, the solution for the 
ancestral mean mutational distance x 2 is identical with the solution for the 
original Hopfield fitness as given in equation (50). So for all values of c, the 
phase transition with respect to x 2 happens at /i — X 2 . For x±, the solution 
becomes more complicated, but the inverse function is simpler. It is given by 



The dependence of x 1 on the mutation rate is shown in figure 5 (top). For 
-1, the second order phase transition is smoothed out and thus vanishes. 
Note that the ambiguity in the solutions that exists in the case c = — 1 (cf. 
figure 3), does not exist here, due to the lacking degeneracy of the maxima 
of the fitness function at x± — and x\ — 1 (cf. figure 4). At the bottom, 




H = 




(51) 



2xi - 1 



22 



X l =X 2 =V2 Z 1 = l-Z 2 = 0.65 




Fig. 4. Quadratic Hopfield-type fitness functions (47) with negative epistasis and 
c = —0.9 (top) and c = —1.1 (bottom) for an uncorrelated pattern X\ = X2 = 1/2 
(left) and a correlated pattern with X\ = 1 — X2 = 0.65 (right). 



figure 5 shows the specific distances y q , using the lower branch of the solution 
for x 2 (as shown in figure 3), which show the second order transition in £2, a 
fitness threshold. With this combination of solutions, for low mutation rates 
the population is centred around the sequence complementary to pattern £ 1 . 
The general picture for the uncorrelated (Xi = 1/2) and correlated {X\ 7^ 1/2) 
choice of patterns is very similar, apart from issues like the exact location of 
the thresholds. 



6.3 Quadratic symmetric Hopfield-type fitness with three and more patterns 



The behaviour of the system with quadratic symmetric Hopfield-type fitness 
has been investigated for three, four and five patterns. However, due to the 
complexity of the analysis for a higher number of patterns, the focus is on the 
case of three patterns. 
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Fig. 5. Ancestral mean partial distance x\ (top) and specific distances y° (thick lines, 
bottom) and y 1 (thin lines, bottom) depending on mutation rates. The quadratic 
Hopfield-type fitness (47) with negative epistasis for two patterns and different val- 
ues of c has been used. Results correspond to uncorrelated patterns X\ = X 2 = 1/2 
(left) and correlated patterns with X\ = 1 — X 2 = 0.65 (right). Data are shown 
for parameter values of c = —0.9, —0.95, —0.99, —1.01, —1.05, —1.1 (top to bottom). 
For clarity, only specific distances y q corresponding to c > — 1 are shown. 

For three patterns, the matrix p reads 



^0 0^ 



11 
10 1 



(52) 



thus there are four X v describing the patterns £, fulfilling Yli=iX v = 1, and 
four variables x v G [0, 1], describing each sequence. The specific distances with 
respect to pattern £ q are given by 

y° = XlZi + X 2 x 2 + + X4X4 , (53) 
y 1 = X lXl + X 2 x 2 + X 3 (l - x 3 ) + X 4 (l - x 4 ) , (54) 
y 2 = X lXl + X 2 (l - x 2 ) + X3X3 + X 4 (l - x 4 ) . (55) 



Similarly to the case of two patterns, the original Hopfield fitness (46) shall be 
considered first, and then variations (47) with negative epistasis, but c ^ -1. 
The investigation is done by means of numerical calculations. 
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6. 3. 1 Variations of the pattern 



For the Hopfield-type fitness, the number of classes in the lumped system 
is given by \S\ = I\l=i(N v + 1). Now consider the case where the patterns 
are chosen randomly with equal probability for each letter at each site. This 
results in a multinom i al pro bability distribution for the number of sites in 
each subset (jwhittlel . Fl97fih . 



p({JVi ""< *>>=5nEbw' (56) 

where n = TP. The means are given by N v = N/n and the variance o 2 = 
N(n - l)/n 2 such that N v = N/n + 0(y/N). 

If the patterns £ 9 for q = l,...,p are chosen randomly (remember £° = 
00. . .0), the X v are thus given by X v = 2~ p + O (77^) • So to mimic the 
case of an infinite sequence length, in which the maximum principle is exact, 
one has to assume uncorrelated patterns with X v = 2~ p for all v = 1, . . . , TP . 
However, the maximum principle can also be used to investigate the case of 
finite sequence length by simulating the finite sequence length through choos- 
ing pattern distributions that do not follow exactly the infinite distribution 
X v = 2~ p , but vary around this mean value with a variance according to the 
sequence length to be simulated. Practically, patterns corresponding to finite 
sequence length N have been obtained by choosing for the 3N sites entries 
or 1 with probability 1/2 at each site, and counting the number of sites N v 
in each class A v , similarly to the example pattern given in equation (8). Thus 
although the patterns were chosen randomly, they are correlated due to the 
finite sequence length. 

As shall be seen in the following, these correlations do account for some in- 
teresting additional features. However, focus first on the case of a genuinely 
infinite sequence length with X v = 2 _p for all v. 



6.3.2 The case of infinite sequence length 

6.3.2.1 Original Hopfield fitness (c = —1). The results for the case 
of three, four and five patterns with original Hopfield fitness (46) and uncor- 
related patterns (X v = 2~ p for all v, corresponding to random patterns for 
infinite sequence length) look exactly like those for two patterns shown in 
figure 3 (left). 

The solutions for the different x v all coincide. For small mutation rates 
H < 1/2, there are again two degenerate solutions for each x v , which can 
be combined in multiple ways for the different x v , yielding any of the patterns 
or their complementary sequences. At ji = 1/2, there is a second order phase 
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c=-0.9 



c=-l.l 




Fig. 6. Ancestral mean partial distances x„ (top) and specific distances y q (bot- 
tom) depending on mutation rates. The quadratic Hopfield-type fitness (47) with 
negative epistasis for three patterns with c = —0.9 (left) and c = —1.1 (right) has 
been used. Results correspond to uncorrelated patterns for infinite sequence length 
{X v = 1/4 W). 

transition, which is a fitness and degradation threshold. For small mutation 
rates, the population is centred around one of the patterns, say £ 9c , and there- 
fore y qc < 1/2, whereas it is completely random with respect to the other 
patterns, yielding y q = 1/2 for q ^ q c . 

6.3.2.2 Deviations from the original Hopfield fitness (c ^ —1)- Fig- 
ure 6 shows the ancestral mean partial distances x v and specific distances y q 
for the quadratic Hopfield-type fitness (47) with negative epistasis for dif- 
ferent values of c in the case of three patterns. Data for four patterns look 
very similar (not shown). As c deviates from —1, the solutions for the x v do 
not coincide, and the phase transition becomes a first order fitness threshold, 
at which all four partial distances x v jump, but it is no more a degradation 
threshold. So contrary to the case of two patterns, where x-i is independent of 
c and the error threshold in X\ is smoothed out by c deviating from — 1, here 
the threshold concerns all four partial distances x v and is sharpened to first 
order by c ^ —1. 

For c ^ -1, the degeneracy between the patterns and their complements is 
lifted, and thus for mutation rates below the threshold, there are only p + 1 
different solutions, correlated with the patterns for c < — 1, and with their 
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Fig. 7. The critical mutation rate ji c depending on the value of c. The quadratic 
Hopfield-type fitness (47) with negative epistasis for three and four patterns has 
been used. Results correspond to uncorrelated patterns simulating infinite sequence 
length. 

complements for c > — 1. For clarity, only one of the solutions is shown in 
figure 6. 

Furthermore, the critical mutation rate decreases with increasing \c + 1|. The 
dependence of the critical mutation rate /i c on the value of c for p = 2, 3 is 
shown in figure 7. At c = —5/4 and c = —3/4, the critical mutation rate 
is fi c = 0, and for values of c ^ [—5/4,-3/4], there are no first order error 
thresholds for either p = 2 or p = 3. This goes in line with a different sequence 
becoming optimal at /i = for these values of c. 

However, for p = 3, there is an additional line of second order error thresholds, 
that approaches /x c = 1/4 as |c + 1| grows. Preliminary results for p = 4 
indicate, that in that case the second order line does not occur. It might 
thus be conjectured that the existence of the second order error threshold line 
depends on the number of patterns being even or odd (remember that for 
p — 1 it does exist). 

This is an interesting result, as for all previously investigated Hopfield-type 
fitness functions (which are limited to the original Hopfield fitness and a 
Hopfield-type truncation selection as far as the author is aware), the existence 
of error thresholds has been reported. 

6.3.3 The case of correlated patterns simulating a finite sequence length 

Figure 8 shows some cases of the ancestral mean partial and specific distances, 
x v and y q , for the original Hopfield fitness (46) with three patterns, which 
are randomly chosen sequences of finite length. The correlations between the 
patterns (and thus the variations of the X v ) are characteristic for the sequence 
length. The six cases of patterns shown here are typical examples for the 
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N= 1000 Af=300 

(0.252, 0.242, 0.243, 0.263) (0.223, 0.267, 0.267, 0.243) 



<* 0.5 




0> 0.5 r=T= 



N = 100 

(0.23,0.37,0.17,0.23) 




(0.263, 0.246, 0.247, 0.244) (0.23, 0.273, 0.246, 0.25) 



<h 0.5 



0.5 




(0.22, 0.31,0.3,0.17) 




Fig. 8. Ancestral mean partial and specific distances, x v and y q , depending on 
mutation rates. The original Hopfield fitness (46) for three patterns has been used. 
Results correspond to two typical examples of random, but correlated patterns 
chosen for sequences of lengths N = 1000 (left), N = 300 (middle) and N = 100 
(right), specified at the top of each graph as (Xi, X2, X3, X4). 
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sequence lengths considered. In the case of long sequences (N = 1000), the 
deviations of the patterns from the infinite sequence limit Xi = X 2 = X 3 = X 4 
are small, and grow with decreasing sequence length. These correlations that 
are introduced into the system have the same effect as a choice of correlated 
patterns in the case of two patterns, such that the single critical mutation rate 
in the case of infinite sequence length is split up into two critical mutation 
rates, at each of which two of the x v show threshold behaviour. For short 
sequence length (N = 100), it can be seen that, particularly at the smaller 
critical mutation rate, the threshold is smoothed out. 

In figure 9, the ancestral mean partial and specific distances, x v and y q , cor- 
responding to the same patterns as in figure 8 are shown for the quadratic 
Hopfield-type fitness (47) with negative epistasis and c = —1.1. For long 
sequence length, these look very similar to the results for infinite sequence 
length (cf. figure 6), showing clearly the single first order phase transition. For 
shorter sequence lengths, they become more and more smoothed out, such 
that at iV = 300, roughly only every other pattern that was simulated shows 
an error threshold, whereas for N = 100, in the vast majority of cases, there is 
no threshold. Note that this effect is present even though the finite sequence 
length was only simulated by choosing the patterns accordingly, so it is a 
feature of the model with correlated patterns. 



6.4 Summary of the results for Hopfield-type fitness 

In this section, the quadratic symmetric Hopfield-type fitness given in equation 
(47) with negative epistasis (i.e., positive quadratic term) was investigated for 
a small number of patterns. The results for the original Hopfield fitness (c = 
— 1) have been compared with those for the generalised quadratic Hopfield- 
type fitness (c^ 1). Furthermore, both uncorrelated patterns (X v = 2~ p for all 
v), corresponding to an infinite sequence length, and correlated patterns (X v 7^ 
2~ p ), simulating a finite sequence length, were considered. For two patterns, 
an analytical treatment was possible, making all values of the X v accessible, 
whereas the case of three or more patterns was treated numerically due to 
the larger number of variables. This means that apart from the uncorrelated 
choice of pattern (X v = 2~ p ), which was investigated for three, four and five 
patterns, only some correlated combinations for three patterns with X v 7^ 2~ p 
were investigated, some typical examples of which are shown in section 6.3. 
The results are summarised as follows: 

• Original Hopfield fitness (c = —1): 

• For uncorrelated patterns, there is one second order error threshold for all 
x v at /i c = 1/2 (investigated p = 1,2, 3, 4). 

• For correlated patterns, there are two second order error thresholds, each 
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Fig. 9. Ancestral mean partial and specific distances, x v and y q , depending on muta- 
tion rates. The Hopfield-type fitness (47) with negative epistasis for three patterns 
and c = —1.1 has been used. Results correspond to the patterns used in figure 8. 

for half of the x v (investigated p — 1, 2). 
• Hopfield-type fitness with c ^ — 1: 

• For uncorrelated patterns, there is a first order error threshold only on 
a restricted range of c (p = 1 : no first order threshold, p = 2, 3 : c G 
[-5/4,-3/4]). 
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For an even number of patterns (p = 1,3), there is an additional second 
order error threshold for any |c + 1| ^ Ac (p = 1 : Ac = 0, p = 3 : Ac m 
0.145). This error threshold does not exist for the investigated cases of an 
odd number of patterns (p = 2,4). 
• For correlated patterns, at p — 1, there is one second order threshold, the 
other one, which is present for the original Hopfield fitness, is smoothed 
out. At p = 2, there is up to one first order threshold, smoothed out 
for more strongly correlated patterns (corresponding to shorter sequence 
length). 

The evaluation the Hopfield-type fitness was limited to the cases of rather 
small numbers of patterns, simply because an increase in the number of pat- 
terns makes the evaluation more complex. However, the Hopfield-type fitness 
was chosen as a potentially realistic fitness because of its ruggedness that can 
be tuned by the number of patterns chosen. The simple cases considered here 
probably do not show as high a degree of ruggedness as one would expect for 
realistic fitness functions. However, the results described here already indicate 
some features that are common for all numbers of patterns investigated here, 
and some that depend on whether the number of patterns is odd or even. It 
would be very interesting to establish whether these results generalise to an 
arbitrary number of patterns. 

Furthermore, the concept of partitioning the set of sites into subsets, which was 
introduced to analyse the Hopfield-type fitness, is very interesting. One could 
imagine a different interpretation for this by classifying sites according to the 
selection strength they evolve under. Some of the behaviour identified for the 
Hopfield system could also occur in such a setting: At intermediate mutation 
rates, partially ordered phases could exist, such that sites that evolve under 
weak selection have passed their error threshold and the population is in a 
phase that is disordered with respect to these sites, whereas at sites that are 
subject to strong selection the order is still maintained. 



7 Conclusion 

The present work has been concerned with the investigation of a determinis- 
tic mutation-selection model in the sequence space approach, using a time- 
continuous formulation. Important observables in these mutation-selection 
models are the population and ancestral distributions p and a, and means 
with respect to these distributions, in particular the population mean fitness 
f and the ancestral mean genotype x. In equilibrium, p is given by the right 
Perron- Frobenius eigenvector of the time-evolution operator H, whereas the 
ancestral distribution is given by the product of both right and left PF eigen- 
vectors p and z, Qi — z,pi. 
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Types have been modelled as two-state sequences. As mutation model, the sin- 
gle step mutation model was used, whereas selection was modelled by Hopfield- 
type fitness functions, using the similarity of a sequence to a number of pat- 
terns to determine its fitness. This allows for a more rugged fitness landscape, 
and the complexity of the fitness can be tuned by the number of patterns. 

The large number of types that arise in the sequence space approach have 
been lumped into classes of types, labelled by the partial distances d v intro- 
duced in section 4 as a generalisation of the Hamming di stance. With this, 
the maximum principle as developed bv lBaake et al. (2005) can be ap plied to 



tne maximum principle as developed bv lrsaake et al. (zuuoj can be ap plied to 
the ca se of Hopfield-type fitness functions as done in section 5, see also lGarskel 



(2005). treating two- and four-state sequences. 



In section 6, the maximum principle derived in section 5 was used to inves- 
tigate the phenomenon of the error threshold. These error thresholds can be 
detected with the maximum principle, because the delocalisation of the popu- 
lation distribution manifests itself as a jump (or at least an infinite derivative 
with respect to the mutation rate) of the ancestral mean genotype x, the 
maximiser. Not all fitness functions give rise to error thresholds, and as the 
error thresholds were first described for a model with highly unrealistic fit- 
ness function, it has been argued that they might be an artifact of this rather 
than a biologically relevant phenomenon. It is therefore clearly necessary to 
investigate more complex fitness functions with respect to this phenomenon. 

Here, quadratic Hopfield-type fitness functions with small numbers of patterns 
have been investigated. For the original Hopfield fitness, the results for all in- 
vestigated numbers of patterns are identical. However, if the fitness differs 
from the original Hopfield fitness, different behaviours are observed for differ- 
ent numbers of patterns. In the case of uncorrelated patterns, corresponding 
to random patterns chosen for infinite sequence length, the observed features 
seem to depend on whether the number of patterns is odd or even. Because 
for correlated patterns, only the cases of two and three patterns were inves- 
tigated, and found to behave differently, it would be interesting to see how 
these results generalise to a higher number of patterns. 

In the original Hopfield fitness, error thresholds were observed for all choices 
of patterns. This is not true for a generalised Hopfield-type fitness. For in- 
stance, for a Hopfield-type fitness with positive epistasis no thresholds were 
observed, going in line with the results for permutation-invariant fitness. But 
also for Hopfield-type fitness functions with negative epistasis, there are not 
necessarily any thresholds, if the fitness deviates too much from the origi- 
nal Hopfield fitness, challenging the commonly held notion that more com- 
plex fitness functions all tend to display error threshold behaviour. The com- 
plexity and r uggedn ess of the original Hopfield fitness have been investigated 
f)Amit et all Il985al lbh and found to be good candidates for realistic fitness 
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functions ( Leuthausser . 1987 : Tarazona . 1992). However, these results do not 
necessarily transfer to the generalised Hopfield-type fitness functions, and 
therefore it would be very useful to study these properties of a generalised 
Hopfield-type fitness functions to analyse which of these factors are responsi- 
ble for generating the thresholds. 
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